41 research outputs found

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    Geometric SMOTE for imbalanced datasets with nominal and continuous features

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234(December), 1-9. [121053]. https://doi.org/10.1016/j.eswa.2023.121053 --- This research was supported by research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021, DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 — Centro de Investigação em Gestão de Informação (MagIC) .Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.publishersversionpublishe

    a literature review

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10, 1-37. [115]. https://doi.org/10.1186/s40537-023-00792-7 --- This research was supported by two research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021 and DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC).The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.publishersversionpublishe

    An investigation on users’ perspective under the COVID-19 pandemic

    Get PDF
    Zhao, Y., & Bacao, F. (2021). How does the pandemic facilitate mobile payment? : An investigation on users’ perspective under the COVID-19 pandemic. International Journal of Environmental Research and Public Health, 18(3), 1-22. [1016]. https://doi.org/10.3390/ijerph18031016Owing to the convenience, reliability and contact-free feature of Mobile payment (M-payment), it has been diffusely adopted in China during the COVID-19 pandemic to reduce the direct and indirect contacts in transactions, allowing social distancing to be maintained and facilitating stabilization of the social economy. This paper aims to comprehensively investigate the technological and mental factors affecting users’ adoption intentions of M-payment under the COVID-19 pandemic, to expand the domain of technology adoption under the emergency situation. This study integrated Unified Theory of Acceptance and Use of Technology (UTAUT) with perceived benefits from Mental Accounting Theory (MAT), and two additional variables (perceived security and trust) to investigate 739 smartphone users’ adoption intentions of M-payment during the COVID-19 pandemic in China. The empirical results showed that users’ technological and mental perceptions conjointly influence their adoption intentions of M-payment during the COVID-19 pandemic, wherein perceived benefits are significantly determined by social influence and trust, corresponding with the situation of pandemic. This study initially integrated UTAUT with MAT to develop the theoretical framework for investigating users’ adoption intentions. Meanwhile, this study originally investigated the antecedents of M-payment adoption under the pandemic situation and indicated that users’ perceptions will be positively influenced when technology’s specific characteristics can benefit a particular situation.publishersversionpublishe

    How does gender moderate customer intention of shopping via live-streaming apps during the COVID-19 pandemic lockdown period?

    Get PDF
    Zhao, Y., & Bacao, F. (2021). How does gender moderate customer intention of shopping via live-streaming apps during the COVID-19 pandemic lockdown period? International Journal of Environmental Research and Public Health, 18(24), 1-24. [13004]. https://doi.org/10.3390/ijerph182413004Shopping through Live-Streaming Shopping Apps (LSSAs) as an emerging consumption phenomenon has increased dramatically in recent years, especially during the COVID-19 lockdown period. However, insufficient studies have focused on the psychological processes undergone in different customer demographics while shopping via LSSAs under pandemic conditions. This study integrated the Unified Theory of Acceptance and Use of Technology 2 with Flow Theory into a Stimulus-Organism-Response framework to investigate the psychological processes of different customer demographics during the COVID-19 lockdown period. A total of 374 validated data were analyzed by covariance-based structural equation modelling. The statistical results demonstrated by the proposed model showed a significant discrepancy between different gender groups, in which Flow, as a mediator, representing users’ engagement and immersion in shopping via LSSAs, was significantly moderated by gender where connection between stimulus components, hedonic moti-vation, trust and social influence and response component perceived value are concerned. This study contributed a theoretical development and a practical framework to the explanation of the mental processes of different customer demographics when using an innovative e-commerce tech-nology. Furthermore, the results can support the relevant stakeholders in e-commerce in their com-prehensive understanding of customers’ behavior, allowing better strategical and managerial de-velopment.publishersversionpublishe

    Advanced Genetic Programming vs. State-of-the-Art AutoML in Imbalanced Binary Classification

    Get PDF
    The objective of this article is to provide a comparative analysis of two novel genetic programming (GP) techniques, differentiable Cartesian genetic programming for artificial neural networks (DCGPANN) and geometric semantic genetic programming (GSGP), with state-of-the-art automated machine learning (AutoML) tools, namely Auto-Keras, Auto-PyTorch and Auto-Sklearn. While all these techniques are compared to several baseline algorithms upon their introduction, research still lacks direct comparisons between them, especially of the GP approaches with state-of-the-art AutoML. This study intends to fill this gap in order to analyze the true potential of GP for AutoML. The performances of the different tools are assessed by applying them to 20 benchmark datasets of the imbalanced binary classification field, thus an area that is a frequent and challenging problem. The tools are compared across the four categories average performance, maximum performance, standard deviation within performance, and generalization ability, whereby the metrics F1-score, G-mean, and AUC are used for evaluation. The analysis finds that the GP techniques, while unable to completely outperform state-of-the-art AutoML, are indeed already a very competitive alternative. Therefore, these advanced GP tools prove that they are able to provide a new and promising approach for practitioners developing machine learning (ML) models. Doi: 10.28991/ESJ-2023-07-04-021 Full Text: PD

    A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces

    Get PDF
    Mutemi, A., & Bacao, F. (2023). A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces. Scientific Reports, 13(1), 1-16. [12499]. https://doi.org/10.1038/s41598-023-38304-5Organized retail crime (ORC) is a significant issue for retailers, marketplace platforms, and consumers. Its prevalence and influence have increased fast in lockstep with the expansion of online commerce, digital devices, and communication platforms. Today, it is a costly affair, wreaking havoc on enterprises’ overall revenues and continually jeopardizing community security. These negative consequences are set to rocket to unprecedented heights as more people and devices connect to the Internet. Detecting and responding to these terrible acts as early as possible is critical for protecting consumers and businesses while also keeping an eye on rising patterns and fraud. The issue of detecting fraud in general has been studied widely, especially in financial services, but studies focusing on organized retail crimes are extremely rare in literature. To contribute to the knowledge base in this area, we present a scalable machine learning strategy for detecting and isolating ORC listings on a prominent marketplace platform by merchants committing organized retail crimes or fraud. We employ a supervised learning approach to classify postings as fraudulent or real based on past data from buyer and seller behaviors and transactions on the platform. The proposed framework combines bespoke data preprocessing procedures, feature selection methods, and state-of-the-art class asymmetry resolution techniques to search for aligned classification algorithms capable of discriminating between fraudulent and legitimate listings in this context. Our best detection model obtains a recall score of 0.97 on the holdout set and 0.94 on the out-of-sample testing data set. We achieve these results based on a select set of 45 features out of 58.publishersversionpublishe

    Extending the Flow Theory with Variables from the UTAUT2 Model

    Get PDF
    Zhao, Y., & Bacao, F. (2020). Theoretical Development: Extending the Flow Theory with Variables from the UTAUT2 Model. In 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020 (pp. 2427-2431). [9345049] (2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCC51575.2020.9345049According to the dramatic development of innovative information technology in worldwide ranges, business climate has changed from traditional commerce to virtual commerce in recent two decades. It is important to synthetically understand customers' adoption intention of new technology for better business management and strategy involved with information technology. Thus, this study extends the Flow theory by integrating variables from the revised Unified Theory of Acceptance and Use of Technology 2 (UTAUT2) model and satisfaction to propose a theoretical development for investigating the factors determining customers' behavioral intention on adopting new information technology. In addition, the proposed theoretical development contributes the relevant researches on systematical understanding customers' adoption intention determined from technological perceptions to mental cognition. Moreover, the proposed framework and measurement method can be applied as reference for relevant researchers and stakeholders to investigate customers' behaviors for further research and future business management and strategy.authorsversionpublishe

    Improving Active Learning Performance through the Use of Data Augmentation

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Improving Active Learning Performance through the Use of Data Augmentation. International Journal of Intelligent Systems, 2023, 1-17. https://doi.org/10.1155/2023/7941878 --- Funding: This research was supported by three research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciencia e a Tecnologia”): SFRH/BD/151473/2021 - MIT Portugal PhD Grant; DSAIPA/DS/0116/2019, and PCIF/SSI/0102/2017.Active learning (AL) is a well-known technique to optimize data usage in training, through the interactive selection of unlabeled observations, out of a large pool of unlabeled data, to be labeled by a supervisor. Its focus is to find the unlabeled observations that, once labeled, will maximize the informativeness of the training dataset, therefore reducing data-related costs. The literature describes several methods to improve the effectiveness of this process. Nonetheless, there is a paucity of research developed around the application of artificial data sources in AL, especially outside image classification or NLP. This paper proposes a new AL framework, which relies on the effective use of artificial data. It may be used with any classifier, generation mechanism, and data type and can be integrated with multiple other state-of-the-art AL contributions. This combination is expected to increase the ML classifier’s performance and reduce both the supervisor’s involvement and the amount of required labeled data at the expense of a marginal increase in computational time. The proposed method introduces a hyperparameter optimization component to improve the generation of artificial instances during the AL process as well as an uncertainty-based data generation mechanism. We compare the proposed method to the standard framework and an oversampling-based active learning method for more informed data generation in an AL context. The models’ performance was tested using four different classifiers, two AL-specific performance metrics, and three classification performance metrics over 15 different datasets. We demonstrated that the proposed framework, using data augmentation, significantly improved the performance of AL, both in terms of classification performance and data selection efficiency (all the codes and preprocessed data developed for this study are available at https://github.com/joaopfonseca/publications/).publishersversionpublishe
    corecore